Preparation

set.seed(123)

#import data
clustering_raw <- read.csv( "clustering.csv", header = 0)

#prepere data
clustering <- scale(clustering_raw) # standardize variables

#look at data
plot( clustering)

The task is to cluster points on 2D plane. Data was standardize for use in algorithms. There is no big outliers, so it wont cause big problems. In my opinion there should be 8 clusters.

(k_max <- round( sqrt( dim( clustering)[1])))
## [1] 20

Maximum predicted number of clusters. It’s often set as round squared of number of points.

Algorithms

K-Means Cluster Analysis

First k has to be chosen. With k-means algorithm it should be chosen when the line is “flattened”. I chose for groups 5, 6, 7, 8. For simplicity I also chose the same groups for hierarchical algorithm.

By X I marked the centers of the groups.

K-means cannot handle non-globular clusters or clusters of different sizes and densities, although it can typically find pure subclusters if a large enough number of clusters is specified. K-means also has trouble clustering data that contains outliers. Outlier detection and removal can help significantly in such situations. Finally, K-means is restricted to data for which there is a notion of a center (centroid). Fortunately this data has none of these problems so K-means works very well.

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined as the minimum of the distance (maximum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, then these single links combine the points into clusters. The single link technique is good at handling non-elliptical shapes, but is sensitive to noise and outliers.. In this case it worked very proly because small outliers were far enough to land in semprete group.

For the complete link or MAX version of hierarchical clustering, the proximity of two clusters is defined as the maximum of the distance (minimum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, then a group of points is not a cluster until all the points in it are completely linked, i.e., form a clique. Complete link is less susceptible to noise and outliers, but it can break large clusters and it favors globular shapes.

For Ward’s method, the proximity between two clusters is defined as the increase in the squared error that results when two clusters are merged. Thus, this method uses the same objective function as K-means clustering. While it may seem that this feature makes Ward’s method somewhat distinct from other hierarchical techniques, it can be shown mathematically that Ward’s method is very similar to the group average method when the proximity between two points is taken to be the square of the distance between them.

DBSCAN

From library dbscan, Density-based clustering locates regions of high density that are separated from one another by regions of low density. DBSCAN is a simple and effective density-based clustering algorithm that illustrates a number of important concepts that are important for any density-based clustering approach. It’s strength lies in not being sensitive to outliers and it DBSCAN markes them as seprete group (I marked them with x). eps is hyperparameter.

Metrics

I used 3 metric for comparison. It’s important that metrics for different algorithms should be compered only in the same number of groups, because number of groups affect metric.

Connectivity

It measures the compactness of the cluster partitions. The connectivity has a value between zero and ∞ and should be minimized.

Dunn Index

The Dunn Index is the ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. It’s said that it show worst-case scenario. The Dunn Index has a value between zero and ∞, and should be maximized.

The Silhouette Width

The Silhouette Width is the average of each observation’s Silhouette value. The Silhouette value measures the degree of confidence in the clustering assignment of a particular observation, with well-clustered observations having values near 1 and poorly clustered observations having values near −1.

Comentary on metric and how they behaive

The best conecetivity is achieved by hierarchical clustering with ward method for 5 clusters. I think it is because each point has many neaibougr in the same group.

The best dunn index is achieved by DBSCAN with eps=1.9. As you can see the groups are very compacted and separated.

The best Silhouette Width is achieved by k means for 5 groups. The centers are I think in the best spot for low variance in groups.

Bibliography

I use some text from:

https://www-users.cs.umn.edu/~kumar001/dmbook/ch8.pdf

and

clValid library manual